1) It is expected that the insurer monitors complaints’ resolution system on a regular basis to “self-identify weaknesses and take corrective action” by closing complaint (irrespective of reason of complaint) with or without monetary relief to policyholder as “justified”.
2) Complaint may be closed with the conclusion that certain informational requirements on the part of the policyholder may have to be met or no action may be required on the part of the insurer or the complaint itself may be ill-motivated.
3) By analyzing the complaints for a given period, the complaints’ resolution systems of Insurers can be rated on a scale of low-medium-high in increasing order of regulatory concern.
a) Thus, low represents the lowest degree of regulatory concern and the highest rating of CRS(outstanding compliance) ,
b) high represents the highest degree of regulatory concern and
c) the lowest rating of CRS(poor compliance).
4) The lowest degree of regulatory concern is assigned to an insurer that maintains a strong CRS and takes action to prevent violations of law and consumer harm.
5) The rating of medium is assigned to an insurer that maintains a CRS that is satisfactory at managing complaints and at limiting violations of law and consumer harm.
6) The highest degree of regulatory concern reflects a CRS deficient at managing complaints and/or deficient at preventing violations of law and consumer harm.
7) You are required to identify a complaint as UFDP or otherwise by using key words of regulations and/or the text forming part of complaint reason/sub-reason.
8) While the “number of complaints” involving Unfair and deceptive practices (UFDP), as defined under Section 5 (Unfair or Deceptive Acts) of Federal Trade Commission Act, USA out of all complaints is an important measure, it is also important for the Insurance Regulator to keep a tab on “duration” (persistence of violation or consumer harm over a period of time) of such complaints.
9) For UFDP Definition in the context of insurance domain of USA, please refer to Sec 4 of Unfair Trade Practices act at
https://www.naic.org/store/free/MDL-880.pdf.
10) This is the Model law for UFDP in Insurance given by NAIC for various state regulators. Thus, each state has come out with the respective state Unfair Insurance Practices Act for Insurance in particular and the state Unfair Trade Practices Act in general for all domains.
11) For example, the Connecticut State made insurance policies subject to both the Connecticut Unfair Insurance Practices Act and the Connecticut Unfair Trade Practices Act.
12) For a quick understanding, it is enough to see whether the practice of Insurer in any Insurance operation, be it sales or claims settlement, offends public policy, or it is immoral, unethical, oppressive or unscrupulous or whether it causes substantial injury to policyholders, competitors or other associated business persons.
13) You are expected to create an analytical and modelling framework to predict the DRC (Degree of Regulatory Compliance) of each insurer as “poor (1)” or “average(2)” or “outstanding(3)”, based on the given data using the above criteria and also generate the top 20 patterns for “poor” on target attribute using the decision tree algorithms only, while answering other questions too cited below.
The data provides the information about the complaints against the insurers of a state in USA .
string Name of the company
id File ID
date Date of complaint registration
date Date of action/resolution
string Coverage details
string Subcoverage details
string (Employer Handling, FOI Inquiry, Marketing/Sales, Other, Premium and Rating, Statute Violation, Underwriting, Unfair Claims Practice, Unknown, Utilization Review, )" cateogry of the complaint
string Sub-category
string regulatory enforcement action
string (Furnished Information Justified, No Action Necessary, Questionable, Unjustified, Voluntary Reconsider) resolution details
numeric amount recovered
String (Closed, Open, Re-Opened) status of resolution
id Complaint id
id Insurer id
Connecticut state
Target : Degree of Regulatory Compliance Target
# General packages.
import os
import pandas as pd
import numpy as np
from IPython.display import Image
# Packages for visualisation.
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS
# Packages for preprocessing.
from sklearn.preprocessing import MinMaxScaler,LabelEncoder,OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
# Packages for model selection and train-test split.
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
# Packages for error metric or model evaluation.
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score
# Packages for ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb
# Packages for tree
from sklearn import tree
from sklearn.tree import _tree
def func_PlotImageCloud(arr):
comment_words = ' '
stopwords = set(STOPWORDS)
# iterate through the csv file
for val in arr:
# typecaste each val to string
val = str(val)
# split the value
tokens = val.split()
# Converts each token into lowercase
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
for words in tokens:
comment_words = comment_words + words + ' '
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
class MyClass():
def create_And_Fit_ML_Models(self, X_train, y_train, X_test, y_test, prefix = '' ):
dict_accuracy_score = {}
dict_model = {}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Naive Bayes Classifier
strModelName = prefix + ' Naive Bayes Classifier'
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~', strModelName ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
# Training a Naive Bayes classifier
gnb = GaussianNB().fit(X_train, y_train)
gnb_predictions = gnb.predict(X_test)
# accuracy on X_test
accuracy = gnb.score(X_test, y_test)
print('Accuracy : ', accuracy)
print('\n')
dict_accuracy_score[strModelName] = accuracy
dict_model['NB'] = gnb
# creating a confusion matrix
cm = confusion_matrix(y_test, gnb_predictions)
print('Confusion Matrix\n')
print(cm)
print('\n')
# print classification report
print('Classification report\n')
print(classification_report(y_test,gnb_predictions))
print('\n')
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Logistic Regression Classifier
strModelName = prefix + ' Logistic Regression Classifier'
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~', strModelName ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
# Training a Logistic Regression classifier
lr = LogisticRegression().fit(X_train, y_train)
lr_predictions = lr.predict(X_test)
# accuracy on X_test
accuracy = lr.score(X_test, y_test)
print('Accuracy : ', accuracy)
print('\n')
dict_accuracy_score[strModelName] = accuracy
dict_model['LR'] = lr
# creating a confusion matrix
cm = confusion_matrix(y_test, lr_predictions)
print('Confusion Matrix\n')
print(cm)
print('\n')
# print classification report
print('Classification report\n')
print(classification_report(y_test,lr_predictions))
print('\n')
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# KNN (k-nearest neighbours) Classifier
strModelName = prefix + ' KNN (k-nearest neighbours) classifier'
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~', strModelName ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
# training a KNN classifier
knn = KNeighborsClassifier(n_neighbors = 7).fit(X_train, y_train)
knn_predictions = knn.predict(X_test)
# accuracy on X_test
accuracy = knn.score(X_test, y_test)
print('Accuracy : ', accuracy)
print('\n')
dict_accuracy_score[strModelName] = accuracy
dict_model['Knn'] = knn
# creating a confusion matrix
cm = confusion_matrix(y_test, knn_predictions)
print('Confusion Matrix\n')
print(cm)
print('\n')
# print classification report
print('Classification report\n')
print(classification_report(y_test,knn_predictions))
print('\n')
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Decision Tree Classifier
strModelName = prefix + ' Decision Tree classifier'
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~', strModelName ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
# training a DescisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 10).fit(X_train, y_train)
dtree_predictions = dtree_model.predict(X_test)
accuracy = dtree_model.score(X_test, y_test)
print('Accuracy : ', accuracy)
print('\n')
dict_accuracy_score[strModelName] = accuracy
dict_model['DT'] = dtree_model
# creating a confusion matrix
cm = confusion_matrix(y_test, dtree_predictions)
print('Confusion Matrix\n')
print(cm)
print('\n')
# print classification report
print('Classification report\n')
print(classification_report(y_test,dtree_predictions))
print('\n')
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Random Forest Classifier
strModelName = prefix + ' Random Forest classifier'
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~', strModelName ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
# Training a Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, bootstrap = True, max_features = 'sqrt').fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
# Probabilities for each class
rf_probs = rf_model.predict_proba(X_test)[:, 1]
accuracy = rf_model.score(X_test, y_test)
print('Accuracy : ', accuracy)
print('\n')
dict_accuracy_score[strModelName] = accuracy
dict_model['RF'] = rf_model
# creating a confusion matrix
cm = confusion_matrix(y_test, rf_predictions)
print('Confusion Matrix\n')
print(cm)
print('\n')
# print classification report
print('Classification report\n')
print(classification_report(y_test,rf_predictions))
print('\n')
# Displaying Accuracies of all the models in %.
df_model_scores = pd.DataFrame(list(dict_accuracy_score.items()), columns=['Model Name', 'Score'])
df_model_scores.Score = [round((item*100),2) for item in df_model_scores.Score]
print('Model Scores \n')
print(df_model_scores)
print('\n')
# Plotting and displaying Accuracies of all the models on a bar graph.
# Using plotly to plot the histogram of 'DRC' class.
fig = px.bar(df_model_scores, x="Model Name", y="Score", range_y=[0,100])
fig.show()
return dict_model
def tree_to_code(tree, feature_names):
'''
Outputs a decision tree model as a Python function
Parameters:
-----------
tree: decision tree model
The decision tree to represent as a function
feature_names: list
The feature names of the dataset used for building the decision tree
'''
tree_ = tree.tree_
feature_name = [
feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
for i in tree_.feature
]
#print("def tree({}):".format(", ".join(feature_names)))
def recurse(node, depth):
indent = " " * depth
if tree_.feature[node] != _tree.TREE_UNDEFINED:
name = feature_name[node]
threshold = tree_.threshold[node]
print("{}if {} <= {}:".format(indent, name, threshold))
recurse(tree_.children_left[node], depth + 1)
print("{}else: # if {} > {}".format(indent, name, threshold))
recurse(tree_.children_right[node], depth + 1)
else:
print("{}return {}".format(indent, tree_.value[node]))
recurse(0, 1)
#loading the first csv file 'Train_Complaints.csv' in pandas data frame 'f_df_complaints'.
f_df_complaints = pd.read_csv('Train_Complaints.csv')
#loading the second csv file 'Train.csv' in pandas data frame 'f_df_drc'.
f_df_drc = pd.read_csv('Train.csv')
# Checking the dimensions of first dataframe.
f_df_complaints.shape
# Checking the dimensions of second dataframe.
f_df_drc.shape
# Displaying first 5 data points of 'f_df_complaints'.
f_df_complaints.head(5)
# Displaying last 5 data points of 'f_df_complaints'.
f_df_complaints.tail(5)
# Displaying first 5 data points of 'f_df_drc'.
f_df_drc.head(5)
# Displaying last 5 data points of 'f_df_drc'.
f_df_drc.tail(5)
# Finding the count of unique values in column 'InsurerID' of 'f_df_drc' dataframe.
print ('InsurerID')
print (len(f_df_drc['InsurerID'].unique()))
# Finding the count of unique values in column 'Company' of 'f_df_complaints' dataframe.
print ('Company')
print (len(f_df_complaints['Company'].unique()))
# Finding the count of unique values in column 'InsurerID' of 'f_df_complaints' dataframe.
print ('InsurerID')
print (len(f_df_complaints['InsurerID'].unique()))
# Extracting the unique values from 'InsurerID' column of 'f_df_drc' into 'arr_insurerID_drc' list.
arr_insurerID_drc = f_df_drc['InsurerID'].unique()
# Extracting the unique values from 'InsurerID' column of 'f_df_complaints' into 'arr_insurerID_complaints' list.
arr_insurerID_complaints = f_df_complaints['InsurerID'].unique()
# Converting both the arrays in the set, and performing Subtract operation of Sets on them and finding the length.
len(list(set(arr_insurerID_drc) - set(arr_insurerID_complaints)))
# Converting both the arrays in the set, and performing Subtract operation of Sets on them and finding the length.
len(list(set(arr_insurerID_complaints) - set(arr_insurerID_drc)))
# Merging the two data frames 'f_df_complaints' and 'f_df_drc' by doing Inner join and
# storing them in 'data_original' dataframe. Joining them on the basis of 'InsurerID'column.
data_original = pd.merge(f_df_complaints, f_df_drc, on='InsurerID', how='inner')
# Storing the 'data_original' in a 'raw_data' for backuup purpose.
raw_data=data_original
# Checking the dimensions of 'data_original' dataframe.
data_original.shape
# Displaying first 5 data points of 'data_original'.
data_original.head(5)
# Removing the data points having NAs in 'DateOfResolution' column of 'data_original' dataframe and
# storing it in 'df_wo_NAsIn_DateOfResolution' dataframe.
df_wo_NAsIn_DateOfResolution = data_original.dropna(axis=0, subset=['DateOfResolution'])
# Subtracting 'DateOfRegistration' from 'DateOfResolution' and storing the integer difference in 'No_of_DaysToResolve' column.
df_wo_NAsIn_DateOfResolution['No_of_DaysToResolve'] = (
(
(
df_wo_NAsIn_DateOfResolution.DateOfResolution.apply(lambda x: pd.Series(pd.to_datetime(x,format='%Y-%m-%d', errors='coerce'))) -
df_wo_NAsIn_DateOfResolution.DateOfRegistration.apply(lambda x: pd.Series(pd.to_datetime(x,format='%Y-%m-%d', errors='coerce')))
) / np.timedelta64(1, 'D')
).astype("int32")
)
df_wo_NAsIn_DateOfResolution[['No_of_DaysToResolve', 'DateOfResolution', 'DateOfRegistration']].head(5)
# Plotting a scattered plot to display complains and number of days took to resolve it.
plt.figure(figsize = (100, 30))
plt.ylim(0,500)
plt.scatter(df_wo_NAsIn_DateOfResolution['ComplaintID'], df_wo_NAsIn_DateOfResolution['No_of_DaysToResolve'],
alpha=0.2, s=1000, marker='o')
plt.xticks([])
plt.xlabel('Complains', fontsize=100)
plt.ylabel('Number of days taken to resolve', fontsize=100);
plt.show()
# Plotting a scattered plot to display complains and number of days took to resolve it using plotly.
# Its a zoomed in version of the above plot.
fig = px.scatter(df_wo_NAsIn_DateOfResolution, x="ComplaintID", y="No_of_DaysToResolve", color="DRC",
size='ComplaintID', hover_data=['No_of_DaysToResolve'], range_y=[0,500])
fig.update_layout(title='Effect of No. of days required to resolve vs DRC class')
fig.show()
# Extract only 'poor' DRC class from the 'data_original' dataframe and store it in 'df_DRC_poor' dataframe.
df_DRC_poor = data_original.loc[data_original.DRC == 'poor']
df_DRC_poor.head(1)
df_DRC_poor.isnull().sum()
# Replacing all NAs with a blank space ''.
df_DRC_poor = df_DRC_poor.replace(np.nan, '', regex=True)
df_DRC_poor.isnull().sum()
# Plotting the word cloud of the 'Coverage' column of only those records who are having 'poor' class in 'DRC' column.
func_PlotImageCloud(df_DRC_poor.Coverage)
# Plotting the word cloud of the 'SubCoverage' column of only those records who are having 'poor' class in 'DRC' column.
func_PlotImageCloud(df_DRC_poor.SubCoverage)
# Plotting the word cloud of the 'Reason' column of only those records who are having 'poor' class in 'DRC' column.
func_PlotImageCloud(df_DRC_poor.Reason)
# Plotting the word cloud of the 'SubReason' column of only those records who are having 'poor' class in 'DRC' column.
func_PlotImageCloud(df_DRC_poor.SubReason)
# Displaying the count of Unique values in all columns of 'data_original' dataframe.
for i in data_original.columns.values:
print (i,' ---> ',len(data_original[i].unique()))
# Using plotly to plot the histogram of 'DRC' class.
tips = data_original
fig = px.histogram(tips, x="DRC")
fig.show()
# Using plotly to plot the histogram of 'Reason' class.
fig = px.histogram(tips, x="Reason", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'ResolutionStatus' class.
fig = px.histogram(tips, x="ResolutionStatus", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'Conclusion' class.
fig = px.histogram(tips, x="Conclusion", color="DRC")
fig.show()
# Its a zoomed in version of the above plot.
fig = px.scatter(raw_data, x="ComplaintID", y="RecoveredFromInsurer", color="DRC",
hover_data=['RecoveredFromInsurer'], range_y=[0,1500])
fig.update_layout(title='Effect of Amount Recovered From Insurer vs DRC class')
fig.show()
data_original.isnull().sum()
# Displaying NAs count of only those columns who are having NAs.
null_columns=data_original.columns[data_original.isnull().any()]
data_original[null_columns].isnull().sum()
# Displaying the DRC's class counts.
print (pd.value_counts(data_original['DRC'].values))
# Storing the names of all the columns of 'data_original' dataframe in 'all_columns' array.
all_columns = data_original.columns
# Displaying the percentage of those rows who are having NAs in both 'Coverage' and 'SubCoverage' columns.
print('Percentage of NAs in both \'Coverage\' and \'SubCoverage\' column is ',
((len(pd.merge(data_original[data_original["Coverage"].isnull()][all_columns],
data_original[data_original["SubCoverage"].isnull()][all_columns],
how='inner')) / len(data_original) ) * 100), '%.')
df_temp_Cov_N_SubCov = pd.merge(data_original[data_original["Coverage"].isnull()][all_columns],
data_original[data_original["SubCoverage"].isnull()][all_columns],
how='inner')
df_temp_Cov_N_SubCov.to_csv("Temp_Coverage_SubCoverage.csv", sep=",", header=True)
colors_pie_chart = ["#E13F29", "#D69A80", "#D63B59"]
labels = 'poor','average','outstanding'
fig = plt.figure(figsize=[14,10])
# Counting the frequency of DRC for each unique value in Original data i.e. 'data_original' and plotting it on a pie-chart.
df_pie_plot_DRC_FullData = pd.value_counts(data_original['DRC'].values)
ax1 = fig.add_axes([0, 0, .30, .30], aspect=1)
ax1.pie(df_pie_plot_DRC_FullData, labels=labels, colors=colors_pie_chart, autopct='%1.1f%%', startangle=0, explode=(0.15, 0, 0))
# Counting the frequency of DRC for each unique value in those data where 'Coverage' and 'Sub_Coverage' columns were having NAs i.e. 'df_temp_Cov_N_SubCov' and plotting it on a pie-chart.
df_pie_plot_DRC_Cov_N_SubCov = pd.value_counts(df_temp_Cov_N_SubCov['DRC'].values)
ax2 = fig.add_axes([.35, .0, .30, .30], aspect=1)
ax2.pie(df_pie_plot_DRC_Cov_N_SubCov, labels=labels, colors=colors_pie_chart, autopct='%1.1f%%', startangle=0, explode=(0.15, 0, 0))
# Extracting only those data points which were not having NAs in 'Coverage' and 'Sub_Coverage' columns from 'data_original' into 'df_temp_data_after_removing_NA'.
df_temp_data_after_removing_NA = data_original
df_temp_data_after_removing_NA = pd.concat([df_temp_data_after_removing_NA, df_temp_Cov_N_SubCov, df_temp_Cov_N_SubCov]).drop_duplicates(keep=False)
# Counting the frequency of DRC for each unique value after removing those rows which were having NAs in 'Coverage' and 'Sub_Coverage' columns i.e. 'df_temp_data_after_removing_NA' and plotting it on a pie-chart.
df_pie_plot_temp_data_after_removing_NA = pd.value_counts(df_temp_data_after_removing_NA['DRC'].values)
ax3 = fig.add_axes([.70, .0, .30, .30], aspect=1)
ax3.pie(df_pie_plot_temp_data_after_removing_NA, labels=labels, colors=colors_pie_chart, autopct='%1.1f%%', startangle=0, explode=(0.15, 0, 0))
ax1.set_title('Complete Data DCR Spread')
ax2.set_title('NA in Cov & Sub-Cov DCR Spread')
ax3.set_title('Data after removing NA from Cov & Sub-Cov')
plt.show()
data_original = df_temp_data_after_removing_NA
null_columns=data_original.columns[data_original.isnull().any()]
data_original[null_columns].isnull().sum()
data_original.loc[data_original['DateOfResolution'].isnull(), ['DateOfResolution']] = '2099-12-31'
null_columns=data_original.columns[data_original.isnull().any()]
data_original[null_columns].isnull().sum()
data_original.isnull().sum()
print(( data_original['SubCoverage'].isnull().sum()/len(data_original) ) * 100 )
data_original = data_original.drop('SubCoverage', axis=1)
all_columns = data_original.columns
data_original['No_of_DaysToResolve'] = (
(
(
data_original.DateOfResolution.apply(lambda x: pd.Series(pd.to_datetime(x,format='%Y-%m-%d', errors='coerce'))) -
data_original.DateOfRegistration.apply(lambda x: pd.Series(pd.to_datetime(x,format='%Y-%m-%d', errors='coerce')))
) / np.timedelta64(1, 'D')
).astype("int32")
)
data_original[['No_of_DaysToResolve', 'DateOfResolution', 'DateOfRegistration']].head(5)
print('-------------------------------')
print(data_original.DRC.value_counts())
drc_perc = round((data_original.DRC.value_counts(sort=False)/sum(data_original.DRC.value_counts()) * 100),2)
print('-------------------------------')
print(drc_perc)
print('-------------------------------')
drc_perc.plot( kind='bar', figsize = (6,6), color = ['blue', 'green', 'lightseagreen'], alpha = 0.4, fontsize = 14 )
plt.ylim([0,100])
plt.xlabel('DRC', fontsize = 14)
plt.ylabel('Percentage', fontsize = 14)
plt.title('DRC Rate (in percentage)', fontsize = 20)
for i, v in enumerate(drc_perc):
plt.text(i-0.2,v+3, str(v) + "%", color='black', fontweight='bold')
plt.show()
months = {
1: 'January',
2: 'February',
3: 'March',
4: 'April',
5: 'May',
6: 'June',
7: 'July',
8: 'August',
9: 'September',
10: 'October',
11: 'November',
12: 'December'
}
weekdays = {
0: 'Monday',
1: 'Tuesday',
2: 'Wednesday',
3: 'Thursday',
4: 'Friday',
5: 'Saturday',
6: 'Sunday'
}
# Extracting Date, Month, Year and DayOfTheWeek fromm the DateOfResolution column
# and storing it in 'DtOf_Reso_Date', 'DtOf_Reso_Month', 'DtOf_Reso_Year' and 'DtOf_Reso_Week_Day' respectively.
data_original['DtOf_Reso_Date'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfResolution'],format='%Y-%m-%d', errors='coerce')).day
data_original['DtOf_Reso_Month'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfResolution'],format='%Y-%m-%d', errors='coerce')).month
data_original['DtOf_Reso_Week_Day'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfResolution'],format='%Y-%m-%d', errors='coerce')).dayofweek
data_original['DtOf_Reso_Year'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfResolution'],format='%Y-%m-%d', errors='coerce')).year
# Converting the MonthCode and WeekDay code in redable format according to above dictionaries and storing it.
data_original.DtOf_Reso_Month = [months[item] for item in data_original.DtOf_Reso_Month]
data_original.DtOf_Reso_Week_Day = [weekdays[item] for item in data_original.DtOf_Reso_Week_Day]
data_original[['DateOfResolution','DtOf_Reso_Date','DtOf_Reso_Month', 'DtOf_Reso_Week_Day','DtOf_Reso_Year']].head(5)
# Extracting Date, Month, Year and DayOfTheWeek fromm the DateOfRegistration column
# and storing it in 'DtOf_Regi_Date', 'DtOf_Regi_Month', 'DtOf_Regi_Year' and 'DtOf_Regi_Week_Day' respectively.
data_original['DtOf_Regi_Date'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfRegistration'],format='%Y-%m-%d', errors='coerce')).day
data_original['DtOf_Regi_Month'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfRegistration'],format='%Y-%m-%d', errors='coerce')).month
data_original['DtOf_Regi_Week_Day'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfRegistration'],format='%Y-%m-%d', errors='coerce')).dayofweek
data_original['DtOf_Regi_Year'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfRegistration'],format='%Y-%m-%d', errors='coerce')).year
# Converting the MonthCode and WeekDay code in redable format according to above dictionaries and storing it.
data_original.DtOf_Regi_Month = [months[item] for item in data_original.DtOf_Regi_Month]
data_original.DtOf_Regi_Week_Day = [weekdays[item] for item in data_original.DtOf_Regi_Week_Day]
data_original[['DateOfRegistration','DtOf_Regi_Date','DtOf_Regi_Month', 'DtOf_Regi_Week_Day','DtOf_Regi_Year']].head(5)
# Using plotly to plot the histogram of 'DtOf_Regi_Year'.
fig = px.histogram(data_original, x="DtOf_Regi_Year", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'DtOf_Regi_Month'.
fig = px.histogram(data_original, x="DtOf_Regi_Month", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'DtOf_Reso_Week_Day'.
fig = px.histogram(data_original, x="DtOf_Regi_Date", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'DtOf_Reso_Week_Day'.
fig = px.histogram(data_original, x="DtOf_Regi_Week_Day", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'DtOf_Reso_Year'.
fig = px.histogram(data_original, x="DtOf_Reso_Year", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'DtOf_Reso_Year'.
df_temp_without2099 = data_original.loc[data_original.DtOf_Reso_Year != 2099]
fig = px.histogram(df_temp_without2099, x="DtOf_Reso_Year", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'DtOf_Reso_Month'.
fig = px.histogram(data_original, x="DtOf_Reso_Month", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'DtOf_Reso_Date'.
fig = px.histogram(data_original, x="DtOf_Reso_Date", color="DRC")
fig.show()
# Using plotly to plot the histogram of 'DtOf_Reso_Week_Day'.
fig = px.histogram(data_original, x="DtOf_Reso_Week_Day", color="DRC")
fig.show()
data_original = f_df_complaints
# Displaying all the columns from 'data_original'.
data_original.columns
# Displaying the datatypes of all the columns in 'data_original' dataframe.
data_original.dtypes
# Displaying the count of Unique values in all columns of 'data_original' dataframe.
for i in data_original.columns.values:
print (i,' ---> ',len(data_original[i].unique()))
# Displaying the NAs count in all the columns of 'data_original'.
data_original.isnull().sum()
# Removing the data points having NAs in 'DateOfResolution' column of 'data_original' dataframe and
# storing it in 'df_wo_NAsIn_DateOfResolution' dataframe.
df_wo_NAsIn_DateOfResolution = data_original.dropna(axis=0, subset=['DateOfResolution'])
# Displaying the NAs count in all the columns of 'df_wo_NAsIn_DateOfResolution'.
df_wo_NAsIn_DateOfResolution.isnull().sum()
# Subtracting 'DateOfRegistration' from 'DateOfResolution' and storing the integer difference in 'No_of_DaysToResolve' column.
df_wo_NAsIn_DateOfResolution['No_of_DaysToResolve'] = (
(
(
df_wo_NAsIn_DateOfResolution.DateOfResolution.apply(lambda x: pd.Series(pd.to_datetime(x,format='%Y-%m-%d', errors='coerce'))) -
df_wo_NAsIn_DateOfResolution.DateOfRegistration.apply(lambda x: pd.Series(pd.to_datetime(x,format='%Y-%m-%d', errors='coerce')))
) / np.timedelta64(1, 'D')
).astype("int32")
)
df_wo_NAsIn_DateOfResolution[['No_of_DaysToResolve', 'DateOfResolution', 'DateOfRegistration']].head(5)
# Dropping columns 'ComplaintID' and 'State' from 'data_original'.
columns_temp_unwanted = ['ComplaintID', 'State']
data_original = data_original.drop(columns_temp_unwanted, axis=1)
data_original.shape
columns_temp = ['Company', 'FileNo', 'Coverage', 'SubCoverage', 'Reason', 'SubReason',
'EnforcementAction', 'Conclusion', 'ResolutionStatus', 'InsurerID']
for i in columns_temp:
print (i)
print (pd.value_counts(data_original[i].values))
# Displaying the summary of 'data_original'.
data_original.describe(include='all')
# Displaying the summary of 'data_original' i.e. not-null(Not NAs) count and datatypes of each column.
data_original.info()
data_original.isnull().sum()
# Displaying NAs count of only those columns who are having NAs.
null_columns=data_original.columns[data_original.isnull().any()]
data_original[null_columns].isnull().sum()
all_columns = data_original.columns
# Displaying the percentage of those rows who are having NAs in both 'Coverage' and 'SubCoverage' columns.
print('Percentage of NAs in both \'Coverage\' and \'SubCoverage\' column is ',
((len(pd.merge(data_original[data_original["Coverage"].isnull()][all_columns],
data_original[data_original["SubCoverage"].isnull()][all_columns],
how='inner')) / len(data_original) ) * 100), '%.')
df_temp_Cov_N_SubCov = pd.merge(data_original[data_original["Coverage"].isnull()][all_columns],
data_original[data_original["SubCoverage"].isnull()][all_columns],
how='inner')
df_temp_Cov_N_SubCov.to_csv("Temp_Coverage_SubCoverage.csv", sep=",", header=True)
# Extracting only those data points which were not having NAs in 'Coverage' and 'Sub_Coverage' columns from 'data_original' into 'df_temp_data_after_removing_NA'.
df_temp_data_after_removing_NA = data_original
df_temp_data_after_removing_NA = pd.concat([df_temp_data_after_removing_NA, df_temp_Cov_N_SubCov, df_temp_Cov_N_SubCov]).drop_duplicates(keep=False)
data_original = df_temp_data_after_removing_NA
data_original.shape
data_original.head(5)
null_columns=data_original.columns[data_original.isnull().any()]
data_original[null_columns].isnull().sum()
# displaying all those rows which are having NAs in 'DateOfResolution' column.
data_original[data_original["DateOfResolution"].isnull()][all_columns]
data_original.loc[data_original['DateOfResolution'].isnull(), ['DateOfResolution']] = '2099-12-31'
null_columns=data_original.columns[data_original.isnull().any()]
data_original[null_columns].isnull().sum()
data_original.isnull().sum()
print(( data_original['SubCoverage'].isnull().sum()/len(data_original) ) * 100 )
data_original = data_original.drop('SubCoverage', axis=1)
all_columns = data_original.columns
data_original.shape
data_original['No_of_DaysToResolve'] = (
(
(
data_original.DateOfResolution.apply(lambda x: pd.Series(pd.to_datetime(x,format='%Y-%m-%d', errors='coerce'))) -
data_original.DateOfRegistration.apply(lambda x: pd.Series(pd.to_datetime(x,format='%Y-%m-%d', errors='coerce')))
) / np.timedelta64(1, 'D')
).astype("int32")
)
data_original[['No_of_DaysToResolve', 'DateOfResolution', 'DateOfRegistration']].head(5)
for i in data_original.columns.values:
print (i,' ---> ',len(data_original[i].unique()))
data_original.dtypes
months = {
1: 'January',
2: 'February',
3: 'March',
4: 'April',
5: 'May',
6: 'June',
7: 'July',
8: 'August',
9: 'September',
10: 'October',
11: 'November',
12: 'December'
}
weekdays = {
0: 'Monday',
1: 'Tuesday',
2: 'Wednesday',
3: 'Thursday',
4: 'Friday',
5: 'Saturday',
6: 'Sunday'
}
# Extracting Date, Month, Year and DayOfTheWeek fromm the DateOfResolution column
# and storing it in 'DtOf_Reso_Date', 'DtOf_Reso_Month', 'DtOf_Reso_Year' and 'DtOf_Reso_Week_Day' respectively.
data_original['DtOf_Reso_Date'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfResolution'],format='%Y-%m-%d', errors='coerce')).day
data_original['DtOf_Reso_Month'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfResolution'],format='%Y-%m-%d', errors='coerce')).month
data_original['DtOf_Reso_Week_Day'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfResolution'],format='%Y-%m-%d', errors='coerce')).dayofweek
data_original['DtOf_Reso_Year'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfResolution'],format='%Y-%m-%d', errors='coerce')).year
# Converting the MonthCode and WeekDay code in redable format according to above dictionaries and storing it.
data_original.DtOf_Reso_Month = [months[item] for item in data_original.DtOf_Reso_Month]
data_original.DtOf_Reso_Week_Day = [weekdays[item] for item in data_original.DtOf_Reso_Week_Day]
data_original[['DateOfResolution','DtOf_Reso_Date','DtOf_Reso_Month', 'DtOf_Reso_Week_Day','DtOf_Reso_Year']].head(5)
# Extracting Date, Month, Year and DayOfTheWeek fromm the DateOfRegistration column
# and storing it in 'DtOf_Regi_Date', 'DtOf_Regi_Month', 'DtOf_Regi_Year' and 'DtOf_Regi_Week_Day' respectively.
data_original['DtOf_Regi_Date'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfRegistration'],format='%Y-%m-%d', errors='coerce')).day
data_original['DtOf_Regi_Month'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfRegistration'],format='%Y-%m-%d', errors='coerce')).month
data_original['DtOf_Regi_Week_Day'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfRegistration'],format='%Y-%m-%d', errors='coerce')).dayofweek
data_original['DtOf_Regi_Year'] = pd.DatetimeIndex(pd.to_datetime(data_original['DateOfRegistration'],format='%Y-%m-%d', errors='coerce')).year
# Converting the MonthCode and WeekDay code in redable format according to above dictionaries and storing it.
data_original.DtOf_Regi_Month = [months[item] for item in data_original.DtOf_Regi_Month]
data_original.DtOf_Regi_Week_Day = [weekdays[item] for item in data_original.DtOf_Regi_Week_Day]
data_original[['DateOfRegistration','DtOf_Regi_Date','DtOf_Regi_Month', 'DtOf_Regi_Week_Day','DtOf_Regi_Year']].head(5)
data_original.shape
data_original.isnull().sum()
data_original.describe(include='all')
columns_temp_unwanted = ['Company', 'FileNo', 'DateOfRegistration', 'DateOfResolution','DtOf_Reso_Date','DtOf_Reso_Year',
'DtOf_Regi_Date','DtOf_Regi_Year', 'DtOf_Reso_Month','DtOf_Reso_Week_Day', 'DtOf_Regi_Month',
'DtOf_Regi_Week_Day']
data_original = data_original.drop(columns_temp_unwanted, axis=1)
data_original.shape
all_columns = data_original.columns
all_columns
columns_numeric = ['No_of_DaysToResolve']
columns_numeric
for col in columns_numeric:
data_original[col] = data_original[col].astype('int16')
data_original.dtypes
columns_categorical = ['Coverage', 'Reason', 'SubReason', 'EnforcementAction', 'Conclusion', 'ResolutionStatus']
columns_categorical
for col in columns_categorical:
data_original[col] = data_original[col].astype('category')
data_original.dtypes
# data_original['DRC'] = data_original['DRC'].map({'poor': 1, 'average': 2, 'outstanding': 3})
# data_original.head(5)
data_original.tail(5)
my_OG_data_backup = data_original
data_original.shape
data_original.columns
data_original_ENC = pd.get_dummies(columns = columns_categorical, data = data_original,
prefix = columns_categorical, prefix_sep="_", drop_first=False)
data_original_ENC.shape
data_original_ENC = data_original_ENC.rename(columns={'Coverage_Fire, Allied': 'Coverage_Fire_Allied',
'Coverage_Worker\'s Compensation':'Coverage_Workers Compensation',
'SubReason_Carrier Never Rec\'d Appl':'SubReason_Carrier Never Recd Appl',
'SubReason_Carrier Never Rec\'d Claim':'SubReason_Carrier Never Recd Claim',
'Coverage_Other [Enter Coverage]':'Coverage_Other_Enter_Coverage',
'SubReason_Other [Enter Sub-Reason]':'SubReason_Other_Enter_Sub-Reason',
'EnforcementAction_Other [Enter Disposition]':'EnforcementAction_Other_Enter_Disposition'
})
# Displaying the count of Unique values in all columns of 'data_original' dataframe.
for i in data_original_ENC.columns.values:
print (i,' ---> ',len(data_original_ENC[i].unique()))
all_cols_after_dummification = data_original_ENC.columns
len(all_cols_after_dummification)
print ('InsurerID')
print (len(data_original_ENC['InsurerID'].unique()))
arr_Unique_InsurerID = data_original_ENC['InsurerID'].unique()
len(arr_Unique_InsurerID)
print ('InsurerID')
print (len(f_df_drc['InsurerID'].unique()))
arr_Coverage_ColNames = [i for i in all_cols_after_dummification if i.startswith('Coverage_')]
arr_Reason_ColNames = [i for i in all_cols_after_dummification if i.startswith('Reason_')]
arr_SubReason_ColNames = [i for i in all_cols_after_dummification if i.startswith('SubReason_')]
arr_EnforcementAction_ColNames = [i for i in all_cols_after_dummification if i.startswith('EnforcementAction_')]
arr_Conclusion_ColNames = [i for i in all_cols_after_dummification if i.startswith('Conclusion_')]
arr_ResolutionStatus_ColNames = [i for i in all_cols_after_dummification if i.startswith('ResolutionStatus_')]
final_column_list = ['InsurerID', 'RecoveredFromInsurer', 'No_of_DaysToResolve'] + arr_Coverage_ColNames + arr_Reason_ColNames + arr_SubReason_ColNames + arr_EnforcementAction_ColNames + arr_Conclusion_ColNames + arr_ResolutionStatus_ColNames
df_master_facts = pd.DataFrame(columns=final_column_list)
df_master_facts
npArrZeros = np.zeros(205)
print("Arr Length : ", len(npArrZeros))
for insurerId in arr_Unique_InsurerID:
df_master_facts.loc[insurerId] = npArrZeros
df_master_facts.head()
df_master_facts.shape
We have already dummified all the columns, which means it will create n number of columns for n number of reasons, so it will be having 1 for those Reasons which is present in complain and 0 for all other possible reasons.
This same will be true for all other categorical columns like Reason, Sub-Reason, Coverage, Enforcement-Action, Conclusion, Resolution-Status.
Calculating the number of occurance of all the Reasnons, Sub-Reasons, Coverage, Enforcement-Action, Conclusion, Resolution-Status for each Insurer from the Complains and dividing it by number of complains to get the value normalized for each Insurer.
This will actually help us to know, how much each Reason, Sub-Reason, Coverage, Enforcement-Action, Conclusion, Resolution-Status are contributing to decide the DRC class of each Insurer.
for insurerId in arr_Unique_InsurerID:
df_temp_By_InsurerID = data_original_ENC[data_original_ENC['InsurerID'] == insurerId]
#df_master_facts.ix[insurerId,'InsurerID']=insurerId
df_master_facts.loc[[insurerId], ['InsurerID']]=insurerId
totalCount = len(df_temp_By_InsurerID['InsurerID'])
for colName in final_column_list[1:]:
df_master_facts.loc[[insurerId], [colName]] = (df_temp_By_InsurerID[colName].sum()) / totalCount
df_master_facts.head()
df_master_facts.shape
# Finding the count of unique values in column 'InsurerID' of 'f_df_drc' dataframe.
print ('InsurerID')
print (len(f_df_drc['InsurerID'].unique()))
# Finding the count of unique values in column 'InsurerID' of 'f_df_drc' dataframe.
print ('InsurerID')
print (len(df_master_facts['InsurerID'].unique()))
# Extracting the unique values from 'InsurerID' column of 'f_df_drc' into 'arr_insurerID_drc' list.
arr_insurerID_drc = f_df_drc['InsurerID'].unique()
# Extracting the unique values from 'InsurerID' column of 'f_df_complaints' into 'arr_insurerID_complaints' list.
arr_insurerID_master_facts = df_master_facts['InsurerID'].unique()
# Converting both the arrays in the set, and performing Subtract operation of Sets on them and finding the length.
len(list(set(arr_insurerID_drc) - set(arr_insurerID_master_facts)))
# Converting both the arrays in the set, and performing Subtract operation of Sets on them and finding the length.
len(list(set(arr_insurerID_master_facts) - set(arr_insurerID_drc)))
# Merging the two data frames 'f_df_complaints' and 'f_df_drc' by doing Inner join and
# storing them in 'data_original' dataframe. Joining them on the basis of 'InsurerID'column.
df_master_facts_with_Y = pd.merge(df_master_facts, f_df_drc, on='InsurerID', how='inner')
# Checking the dimensions of 'data_original' dataframe.
df_master_facts_with_Y.shape
# Displaying first 5 data points of 'data_original'.
df_master_facts_with_Y.head(5)
# Displaying the datatypes of all the columns in 'data_original' dataframe.
df_master_facts_with_Y.dtypes
df_master_facts_with_Y.columns
df_master_facts_with_Y['DRC'] = df_master_facts_with_Y['DRC'].map({'poor': 1, 'average': 2, 'outstanding': 3})
df_master_facts_with_Y.head(5)
df_master_facts_with_Y['DRC'] = df_master_facts_with_Y['DRC'].astype('category')
df_master_facts_with_Y.dtypes
df_master_facts_with_Y = df_master_facts_with_Y.drop('InsurerID', axis=1)
df_master_facts_with_Y.shape
plt.figure(figsize=(10,10))
sns.heatmap(df_master_facts_with_Y.corr())
plt.show()
X, y = df_master_facts_with_Y.loc[:,df_master_facts_with_Y.columns!='DRC'], df_master_facts_with_Y.loc[:,'DRC']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state = 0)
X_train.head(2)
dict_models_wo_standardization = MyClass().create_And_Fit_ML_Models(X_train, y_train, X_test, y_test, 'No Standardization -')
scaleSS = StandardScaler()
scaleSS.fit(X_train)
X_train_stdScal = scaleSS.transform(X_train)
X_test_stdScal = scaleSS.transform(X_test)
dict_models_StandardScalar = MyClass().create_And_Fit_ML_Models(X_train_stdScal, y_train, X_test_stdScal, y_test, 'StdScaler -')
scaleMMS = MinMaxScaler()
scaleMMS.fit(X_train)
X_train_stdMinMaxScal = scaleMMS.transform(X_train)
X_test_stdMinMaxScal = scaleMMS.transform(X_test)
dict_models_StandardMinMaxScalar = MyClass().create_And_Fit_ML_Models(X_train_stdMinMaxScal, y_train, X_test_stdMinMaxScal, y_test, 'MinMaxScaler -')
dict_AdvModels_AccuracyScore = {}
param_grid_dt = {'criterion': ['entropy'],
'max_depth': [6,8,10,12],
'max_features':['log2']}
dt_adv = DecisionTreeClassifier()
dt_grid = GridSearchCV(dt_adv, param_grid=param_grid_dt, cv=5)
dt_grid.fit(X_train_stdMinMaxScal,y_train)
dt_best = dt_grid.best_estimator_
print(f"The best parameters are {dt_grid.best_params_}")
test_pred = dt_best.predict(X_test_stdMinMaxScal)
advModelsAccuracy = accuracy_score(y_test, test_pred)
print(f"Accuracy - Test data \n{advModelsAccuracy}")
dict_AdvModels_AccuracyScore['Decision Tree Classifier'] = advModelsAccuracy
# tree_to_code(dt_adv, list(X.columns))
param_grid_rf = {"n_estimators" : [380, 400, 410],
"max_depth" : [12, 15, 17],
"max_features" : [45, 48, 50],
"min_samples_leaf" : [6, 7, 8]}
# Accuracy = 80.00%
# {'max_depth': 15, 'max_features': 48, 'min_samples_leaf': 6, 'n_estimators': 400}
#param_grid_rf = {"n_estimators" : [395, 397, 400],
# "max_depth" : [15, 16],
# "max_features" : [47, 48, 49],
# "min_samples_leaf" : [5, 6, 7]}
# Accuracy = 80.00%
# {'max_depth': 16, 'max_features': 48, 'min_samples_leaf': 6, 'n_estimators': 395}
rf_adv = RandomForestClassifier()
clf_rf_adv = GridSearchCV(rf_adv, param_grid_rf, cv=5)
clf_rf_adv.fit(X_train_stdMinMaxScal, y_train)
print(f"The best parameters combination for Random Forest are {clf_rf_adv.best_params_}")
test_pred = clf_rf_adv.predict(X_test_stdMinMaxScal)
advModelsAccuracy = accuracy_score(y_test, test_pred)
print(f"Accuracy - Test data \n{advModelsAccuracy}")
dict_AdvModels_AccuracyScore['Random Forest'] = advModelsAccuracy
# param_grid_xgbm = {'learning_rate':[0.1,0.5],
# 'n_estimators': [20],
# 'subsample': [0.3,0.9]}
# Accuracy = 80.74%
# {'learning_rate': 0.5, 'n_estimators': 20, 'subsample': 0.9}
param_grid_xgbm = {'learning_rate':[0.5, 0.4, 0.3],
'n_estimators': [20, 30],
'subsample': [0.3,0.9]}
# Accuracy = 80.74%
# {'learning_rate': 0.5, 'n_estimators': 20, 'subsample': 0.9}
xgbm_adv = xgb.XGBClassifier()
xgbm_grid = GridSearchCV(xgbm_adv, param_grid_xgbm, cv=5)
xgbm_grid.fit(X_train_stdMinMaxScal, y_train)
xgbm_best = xgbm_grid.best_estimator_
print(f"The best parameters are {xgbm_grid.best_params_}")
test_pred = xgbm_best.predict(X_test_stdMinMaxScal)
advModelsAccuracy = accuracy_score(y_test, test_pred)
print(f"Accuracy - Test data \n{advModelsAccuracy}")
dict_AdvModels_AccuracyScore['XGBoost'] = advModelsAccuracy
param_grid_gbm = {'max_depth': [8,10,12,14],
'subsample': [0.8, 0.6,],
'max_features':[0.2, 0.3],
'n_estimators': [10, 20, 30]}
# Accuracy = 80.74%
#{'max_depth': 12, 'max_features': 0.2, 'n_estimators': 20, 'subsample': 0.8}
# param_grid_gbm = {'max_depth': [12,13,14],
# 'subsample': [0.8, 0.7,],
# 'max_features':[0.2, 0.3],
# 'n_estimators': [20, 25, 30]}
# Accuracy = 79.25%
# {'max_depth': 13, 'max_features': 0.3, 'n_estimators': 25, 'subsample': 0.8}
# param_grid_gbm = {'max_depth': [12,13,14],
# 'subsample': [0.8, 0.9,],
# 'max_features':[0.3, 0.4],
# 'n_estimators': [25, 27, 30]}
# Accuracy = 77.77%
# {'max_depth': 13, 'max_features': 0.3, 'n_estimators': 25, 'subsample': 0.8}
gbm_adv = GradientBoostingClassifier()
gbm_grid = GridSearchCV(gbm_adv, param_grid=param_grid_gbm, cv=5)
gbm_grid.fit(X_train_stdMinMaxScal,y_train)
gbm_best = gbm_grid.best_estimator_
print(f"The best parameters are {gbm_grid.best_params_}")
test_pred = gbm_best.predict(X_test_stdMinMaxScal)
advModelsAccuracy = accuracy_score(y_test, test_pred)
print(f"Accuracy - Test data \n{advModelsAccuracy}")
dict_AdvModels_AccuracyScore['Gradient Boost Classifier'] = advModelsAccuracy
param_grid_ada = {'n_estimators': [20,40,60], 'learning_rate':[0.5,1.0]}
base_estimator = DecisionTreeClassifier(criterion='gini', max_depth=10)
ada_adv = AdaBoostClassifier(base_estimator=base_estimator)
ada_grid = GridSearchCV(ada_adv, param_grid=param_grid_ada, cv=5)
ada_grid.fit(X_train_stdMinMaxScal,y_train)
ada_best = ada_grid.best_estimator_
print(f"The best parameters are {ada_grid.best_params_}")
test_pred = ada_best.predict(X_test_stdMinMaxScal)
advModelsAccuracy = accuracy_score(y_test, test_pred)
print(f"Accuracy - Test data \n{advModelsAccuracy}")
dict_AdvModels_AccuracyScore['AdaBoost Classifier'] = advModelsAccuracy
df_Adv_ModelScores = pd.DataFrame(list(dict_AdvModels_AccuracyScore.items()), columns=['Model Name', 'Score'])
df_Adv_ModelScores.Score = [round((item*100),2) for item in df_Adv_ModelScores.Score]
print('Model Scores \n')
print(df_Adv_ModelScores)
print('\n')
# Plotting and displaying Accuracies of all the models on a bar graph.
fig = px.bar(df_Adv_ModelScores, x="Model Name", y="Score", range_y=[0,100])
fig.show()